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OPERATING SYSTEMS 

This invention relates to operating systems. More particularly, this invention 
relates to systems, methods and computer programs for running multiple operating 
systems concurrently. 

S For some computer programs^ it is critical that steps in the program are 

performed within defined time periods, or at defined times. Examples of such 
programs are control programs for operating mobile telephones, or for operating 
private branch exchanges (PBXs) or cellular base stations. Typically, the program 
must respond to external events or changes of state in a consistent way, at or witiiin a 

10 certain time after the event. This is referred to as operating in "real time". 

For many other programs, however, the time taken to execute the program is 
not critical. This applies to most common computer programs, including spreadsheet 
program, word processing programs, payroll packages, and general reporting or 
analysis programs. On the other hand, whilst the exact time taken by such programs is 

15 not critical, in most cases, users would prefer quicker execution where tibis is possible. 

Applications programs interact with the computers on which they run through 
operating systems. By using the ^plications progranoming interface (APJ) of the 
operating system, the applications program can be written in a portable &shion, so that 
it can execute on dififerent computers with different hardware resources. Additionally, 

20 common operating systems such as Linux or Windows provide multi-tasking; in other 
words, they allow several program to operate concurrently. To do so, they provide 
scheduling; in other words, they share the usage of the resources of the computer 
between the different programs, allocating time to each in accordance with a 
scheduling algorithm. Operating systems of the this kind are very widely used, but 

25 they generally make no provision for running real time appUcations, and they therefore 
are unsuitable for many control or commimications tasks. 

For such tasks, therefore, real time operating systems have been developed; one 
example is ChorusOS (also know as Choms) and its derivatives. Chorus is available 
as open source software firom: 

30 http://www.experim6ntalstuff.comyTechnologies/ChorusO 
and Jaluna at 
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http://www.jaluna.coin/ 

It is described in "ChonisOS Features and Architecture overview" Francois 
Armand, Sun Technical Report, August 2001, 222p, available from: 
http://wwwjaluna.com/developer/papers/COSDESPERF.pdf 
5 These operating systems could also be used to run other types of programs. 

However, users understandably wish to be able to run the vast number of "legacy" 
programs which are written for genial purpose operating systems such as Windows or 
Linux, without having to rewrite them to run on a real time operating sj^tem. 

It would be possible to provide a "dual boot" system, allowing the user to run 

10 either one operating system or the other, but there are many cases where it would be 
desirable to be able to run a "legacy" program at the same time as running a real time 
program. For example, telecommunications network infrastructure equipment, third 
generation mobile phones and other advanced phones, and advanced electronic gaming 
equipment may reqxiire both realtime appUcations (e.g. game playing graphics) and 

15 non-realtime appUcations (game download). 

In US 5903752 and US 5721922, an attempt is made to incorporate a real time 
environment into a non real time operating systrai by providing a real time multi- 
tasking kemel in the interrupt handling environment of the non real time operating 
system (such as Windows). 

20 One approach which has been widely used is "emulation". Typically, an 

emulator program is written, to run under the real time operating system, which 
interprets each instruction of a program written for a general purpose operating system, 
and performs a corresponding series of instructions under the real time operating 
s}^em. However, since one instruction is always replaced by many, emulation places 

25 a heavier load on the computer, and results in slower performance. Similar problems 
arise from the approach based on providing a virtual machine (e.g. a Java™ virtual 
machine). Examples of virtual machine implementations are EP 1059582, US 
5499379, and US 4764864. 

A further similar technique is described in US 5995745 (Yodaiken). Yodaiken 

30 describes a system in which a multi tasking real time operating system runs a general 
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purpose operating system as one of its tasks, pre-empting it as necessary to perform 
real time tasks. 

Another approach is to run the realtime operating system as a module of the 
general purpose operating system, as described in for example EP 0360135 and the 
5 article 'TVIerging real-time processing and UNIX V", (Gosch), ELECTRONICS, 
September 1990 p62. In this case, hardware interrupts are selectively masked with the 
intention that those concerned with the general purpose operating system should not 
pre-empt the realtime operating system. 

Another approach is that of ADEOS (Adaptive Domain Enviroimient for 
10 Operating Systems), described in a White Paper at 
http://opersys.com/flp/pub/Adeos/adeos.pdf 

ADEOS provides a nanokemel which is intended, amongst other things, for 
running multiple operating systems although it appears only to have been implemented 
with Linux. One proposed use of ADEOS was to allow ADEOS to distribute 
15 interrupts to RTAI (Realtime Application Interface for Linux) for which see: 
http://www.aero.polimi.it/-rtai/appUcations/. 

EP 1054332 describes a system in which a "switching unit*' (which is not 
described in sufficient detail for full understanding) runs a realtime and a general 
purpose operating system. Hardware interrupts are handled by a common interrupt 

20 handler, and in some embodiments, they are handled by the realtime operating system, 
which then generates software interrupts at a lower priority level which are handled by 
routines in the secondary operating system. 

An object of the present invention is to provide an improved system, method 
and computer program for running multiple operating systems simultaneously, even 

25 when the systems are designed for different purposes. In particular, the present 
invention aims to allow one of the operating systems (for example, a real time 
operating systems) to perform without disturbance, and the other (for example, a 
general purpose operating system) to perform as well as possible using the remaining 
resources of the computer. 

30 Accordingly, in one aspect, the present invention provides a system in which 

multiple operating systems are slightly modified and provided with a common program 
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which schedules between them, in which one of the operating systems (the '^primary'* 
or "critical*' operating system) is favoxired over another (the "secondary" or non- 
critical operating system). Preferably, the invention allocates hardware preferentially 
to the critical operating system, and it denies the secondary operating system or 
5 systems access which would interfere with that of the critical operating system. 
Preferably, the present invention uses the critical operating system drivers to access 
shared resources, even if the access is requested by the secondary operating system. 
However, in no sense is the critical operating system "running" the secondary 
operating system, as in US 5995745; each system ignores the others running alongside 

10 it and only communicates with the common program (corresponding to a nanokemel 
of the prior art) which brokers the access to the drivers of the critical operating system. 

Preferably, the secondary operating systems are modified so that they cannot 
mask intermpts, and their interrupt service routines are modified to make them 
responsive to messages indicating that an interrupt occiuxed. The common program 

15 handles all hardware exceptions by passing them to the interrapt service routines of the 
primary operating system, and where a hardware interrupt was intended for one of the 
secondary operating systems, an interrupt message or notification is generated. Next 
time that secondary operating system is scheduled by the common program, the 
message or notification is passed to it, and the common program calls its interrupt 

20 service routine to service the interrupt. 

Thus, the secondary operating systems caimot pre-empt the primary operating 
system (or, in general, a higher importance secondary operating system) in any way on 
occurrence of an intermpt, since all are initially handled by tiie primary operating 
systCTi and only notified to the secondary operating system for which they are destined 

25 after the primary operating system has finished execution and that secondary operating 
system is scheduled. 

Handling of such interrupts is thus deferred imtil no critical task in the primary 
operating system is occurring. When they are eventually actioned, however, the 
routines of the secondary operating system may operate substantially unmodified 

30 fashion so that the behaviour is (except for the delay) as expected by the secondary 
operating system. 
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Such a system is described in our earlier-filed PCX application 
PCT/EP04/00371, incorporated herein by reference. 

This invention relates to an implementation on a Complex Instruction Set 
Computer (CISC) such as one based on the Intel IA-32 architecture. CISC processors 
5 have multiple registers, the states of which need to be saved and retrieve on switching 
between operating systems. They may have multiple memory addressing modes, so 
that different operating systems may be running explications in different modes, and 
they may have sophisticated data structures the states of which need to be saved and 
retrieved. Such factors make it non-trivial to implement a system whereby multiple 
10 operating systems can execute concurrently and in a stable fashion. 

For a backgroimd understanding of the well known the Intel IA-32 architecture, 
the following are incorporated by reference: 

Intel Architecture Software Developer's Manual, Volume 1: Basic Architecture 
(Order Number 245470-01 1) 
15 Intel Architecture Software Developer's Manual, Volume 2: Instruction Set 

Reference (Order Number 245471-01 1) 

Intel Architecture Software Developer's Manual, Volume 3: System 
Programming Guide (Order Number 245472-01 1) 

All are available free of charge from Intel Corporation, PO Box 7641, Mt 
20 Prospect IL 60056-7641 and can be downloaded from http://www.inteLcom 

Without limitation, some of the innovative features which are disclosed herein 
are as follows: 

A — The adaptation of the operating systems to replace processor instructions 
by calls to methods, which use the resources of the hardware resource despatcher; and 
25 particularly but not exclusively; 

B - substitution of instructions which read or write memory addressing data 
structures such as vector tables (such as the GDT and IDT tables), allowing; 

C - replication of the original memory addressing data structures (such as the 
GDT and IDT tables) to provide proxy data structures for operating systems such as 
30 Linux, called by the methods instead of accessed through the replaced processor calls. 
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leaving the original memory addressing data structm'es for access by the hardware 

resource despatcher; 

D - provision of "public" or "open" parts and "private" or **hidden" parts 

within the data structures used by the hardware resource despatcher; 
5 E - "lazy" transfer of the Floating Point Unit (FPU) between operating 

systons, as well as (as is known) between {^plications on a single operating system; 
F - Use of task gates to switch cleanly between operating systems, and 
G - Use of some specific routines rather than task gates to switch rapidly 

between the primary operating system and the hardware resource despatcher without 
10 changmg memory context (so that not all registers need to be saved, making the switch 

faster); 

H - Preventing the secondary operating system firom masking interrupts, 
' except, preferably; 

I — during task switching operations (which are of very short duration) and/or; 
15 J during hardware resource despatcher operations which are of very long 

duration (such as RS-232 communications or character output); 

K - use of two separate stack structures for context switching; one for traps and 
one for asynchronous tasks; 

L — use of part of the supervisory context of the primary operating system to 
20 run the hardware resource despatcher. 

Other aspects, embodiments and preferred features, with corresponding 
advantages, Avill be apparent from the following description, claims and drawings. 

Embodiments of the invention will now be described, by way of example only, 
with reference to the accompanying drawings, in which: 
25 Figure 1 is a block diagram showing the elements of a computer system on 

which the present invention can execute; 

Figure 2a is a diagram illustrating the arrangement of software in the prior art; 

and 

Figure 2b is the corresponding diagram illustrating the arrangement of software 
30 according to the present embodiment; 
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Figure 3 is a flow diagram showing the stages in creating the software of 
Figure 2b for the computer of Figure 1; 

Figure 4 show the components of a hardware resource dispatcher foiming part 
of Figure 2b; 

S Figure 5 illustrates the program used in a boot and initiahsation sequence; 

Figure 6 illustrates the system memory image used in the boot or initialisation 
process; 

Figure 7 illustrates the transition from a primary operating system to a 
secondary operating system; 
10 Figure 8 illustrates the transition from a secondary operating system to a 

primary operating system; 

Figure 9a illustrates tiie conamunication between applications running on 
difTerent operating systems according to the invention; 

Figure 9b illustrates the communication between applications running on 
15 ' different operating systems on different computers according to the invention; 

Figure 10 shows an example of the primary, secondary and nanokemel virtual 
address spaces. 

Figure 1 1 shows how the memory context is switching in time; 
Figure 12 illustrates the visible part of flie nanokemel context; 
20 Figure 13 illustrates the hidden part of the nanokemel context; 

Figure 14 shows how an initial TSS is initialized prior to the task switching; 
Figure 15 shows non zero fields of a nanokemel TSS; 
Figure 16 shows typical states of a TSS stack.; 

Figure 17 shows how segmentation and paging are used in memory addressing 
25 in the Intel architecture; and 

Figure 18 shows the system-level registers and data stmctures in the Intel 
architecture. 
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Introduction 
System Hardware 

A computer system to which the system is apphcable 100 comprises a central 
5 processing unit (CPU) 102, such as a Pentium 4™ CPU available from Intel 
Corporation, or PowerPC CPU available from Motorola (the embodiment has been 
implemented on both), coupled via a system bus 104 (comprising control, data and 
address buses) to a read-only memory (ROM) chip 106; one or more banks of random 
access memory (RAM) chips (108); disk controller devices 110 (for example IDE or 

10 SCSI controllers, connected to a floppy disk drive, a hard disk drive, and additional 
removable media drives such as DVD drives); one or more input/output ports (112) 
(for example, one or more USB port controllers, and/or parallel port controllers for 
connection to printer and so on); an expansion bus 1 14 for bus connection to extemal 
or intemal peripheral devices (for example the PCI bus); and other system chips 116 

15 (for example, graphics and sound devices). Examples of computers of this type are 
personal computers (PCs) and workstations. However, the appUcation of the invention 
to other computing devices such as mainframes, embedded microcomputers in control 
systems, and PDAs (in which case some of the indicated devices such as disk drive 
controllers maybe absent) is also disclosed herein. 

20 

Management of Software 

Referring to Figure 2a, in use, the computer 100 of Figure 1 runs resident 
programs comprising operating system kernel 202 (which provides the output routines 
allowing access by the CPU to the other devices shown in Figure 1); an operating 
25 system user interface or presentation layer 204 (such as X Windows); a nwddleware 
layer 206 (providing networking software and protocols such as, for instance, a TCP/IP 
stack) and applications 208a, 208b, which run by making calls to the API routines 
forming the operating system kernel 202. 

The operating system kernel has a number of tasks, in particular: 
30 ■ scheduling (i.e., sharing the CPU and associated resources between different 

applications which are running); 
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■ memory management (i.e. allocating memory to each task, and, where 
necessary, swapping data and programs out of memory add on to disk drives); 

■ providing a file system; 

■ providing access to devices (typically, through drivers); 
5 ■ interrupt handling; 

■ providing an appUcations programming interface enabling the applications to 
interact with system resources and users. 

The kernel may be a so-called "monolithic kernel" as for Unix, in which case 
the device drivers form part of the kernel itself Altematively, it may be a 
10 "microkernel" as for Choms, in which case the device drivers are separate of the 
kemeL 

In use, then, when the computer 100 is started, a bootstrap program stored in 
ROM 106 accesses the disk controllers 110 to read the file handling part of the 
operating system firom permanent storage on disk into RAM 108, then loads the 
15 remainder of the operating system into an area of RAM 108. The operating system 
then reads any appUcations from the disk drives via the disk controllers 110, allocates 
space in RAM 108 for each, and stores each appUcation in its allocated memory space. 

During operation of the appUcations, the scheduler part of the operating system 
divides the use of the CPU between the dififerent appUcations, allowing each a share of 
20 the time on the processor according to a scheduling poUcy. It also manages use of the 
memory resources, by "swapping out" infirequently used applications or data (i.e. 
removing them firom RAM 108 to firee up space, and storing them on disk). 

Finally the routines making up the appUcations programming interface (API) 
are called fix)m the applications, to execute fimctions such as input and output, and the 
25 intermpt handUng routines of the operating system respond to interrupt and events. 

Summary of Principles of the Preferred Embodiment 

In the preferred embodiment, each operating system 201, 202 to be used on the 
computer 100 is sUghtly re-written, and a new low-level program 400 (termed here the 
30 "hardware resource dispatcher", and sometimes known as a "nanokemel" although it is 
not the kernel of an operating system) is created. The hardware resource dispatcher 
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400 is specific to the particular type of CPU 102, since it interacts with the processor. 

The versions of the operating systems which are modified 201, 202 are also those 

which are specific to the hardware, for reasons which will become apparent. 

The hardware resource dispatcher 400 is not itself an operating system. It does 
5 not interact with the applications programs at all, and has very limited fimctionality. 

Nor is it a virtual machine or emulator; it requires the operating systems to be modified 

in order to cooperate, even though it leaves most of the processing to the operating 

S3^tems themselves, running their native code on the processor. 

It performs the following basic fimctions: 
10 ■ loading and Starting each ofthe multiple operating systems; 

■ allocating memory and other system resources to each ofthe operating systems; 

■ scheduling the operation ofthe dififerent operating systems (i.e. dividing CPU 
time between them, and managing the change over between them); 

■ providing a 'Virtualised device'* method of indirect access to those system 
15 devices which need to be shared by the operating systems (-Virtualising" the 

devices); 

■ providing a communications link between the operating systems, to allow 
applications nmning on different operating systems to communicate with each 
other. 

20 The operating systems are not treated equally by the embodiment. Instead, one 

ofthe operating systems is selected as the "critical" operating systems (this will be the 
real time operating system), and the or each otiier operating system is treated as a "non 
critical" or "secondary" operating systems (this will be the or each general purpose 
operating system such as Linux). 

25 When the hardware resource dispatcher is designed, it is provided with a data 

structure (e.g. a table) Usting the available system resources (i.e. devices and memory), 
to enable as many system devices as possible to be statically allocated exclusively to 
one or other ofthe operating systems. 

For example, a parallel printer port might be statically allocated to the general 

30 puipose operating system 202, which will often run applications which will need to 
produce printer output. On the other hand, an ISDN digital line adapter port may be 
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permanently allocated to the real time operating system 201 for commvmications. This 
static allocation of devices wherever possible means that each operating system can 
use its existing drivers to access statically allocated devices without needing to call the 
hardware resource dispatcher. Thus, there is no loss in execution speed in accessing 
5 such devices (as there would be if it acted as a virtual machine or emulator). 

In the case of system devices which must be shared, the hardware resource 
dispatcher virtualises uses of the devices by the non-critical operating systems, and 
makes use of the drivers supplied with tibie critical operating system to perform the 
access. Likewise, for interrupt handling, the interrupts pass to the critical operating 
10 system interrupt handling routines, which either deal with the interrupt (if it was 
intended for the critical operating system) or pass it back through the hardware 
resource dispatcher for forwarding to a non critical operating system (if that was where 
it was destined). 

On boot, the hardware resource dispatcher is fbrst loaded, and it then loads each 
15 of the operating systems in a predetermined sequence, starting with the critical 
operating system, then following with the or each secondary operating system in turn. 
The critical operating system is allocated the resources it requires firom the table, and 
has a fixed memory space to operate in. Then each secondary operating system in turn 
is allocated the resources and memory space it requires from the available remaining 
20 resources. 

Thus, according to the embodiment, the resources used by the operating 
systems are separated as much as physically possible, by allocating each its own 
memory space, and by providing a static allocation of devices exclusively to the 
operating systems; only devices for which sharing is essential are shared. 
25 In operation, the hardware resource dispatcher scheduler allows the critical 

operating system to operate until it has concluded its tasks, and then passes control 
back to each non critical operating system in turn, until the next intermpt or event 
occurs. 

The embodiment thus allows a multi operating system environment in which 
30 the operation of the critical operating system is virtually unchanged (since it uses its 
original drivers, and has first access to any interrupt and event handling). The 
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secondary operating systems are able to operate efficiently, within the remaining 
processor time, since in most cases they will be using their own native drivers, and will 
have exclusive access to many of the system devices. Finally, the hardware resource 
dispatcher itself can be a small program, since it handles only limited functions, so that 
5 system resources are conserved. 

The preferred embodiment is also economic to create and maintain, because it 
involves only limited changes to standard commercial operating systems which will 
already have been adapted to the particular computer 100. Further, since the changes 
to the operating systems are confined to architecture specific files handling matters 
10 such as interrupt handling, and configuration at initialising time, which interface with 
the particular type of computer 100, and which are unlikely to change as firequently as 
the rest of the operating system, there may be little or no work to do in adapting new 
versions of the same operating system to work in a multiple operating system fashion. 

15 ' Detailed Description of the Preferred Embodiment 

In this embodiment, the computer 100 is an Intel 386 family processor (e.g. a 
Pentium processor) (step 302). The critical operating system 201 was the C5 operating 
system (the real time microkernel of Jaluna-1, an open-source version of the fifth 
generation of the ChorusOS sj^tem, available for open source, free download firom 

20 http://www.jaluna.com). 

In step 306, the ChorusOS operating systCTi kernel 201 is modified for 
operating in multiple operating system mode, which is treated in the same way s 
porting to a new platform (i.e. writing a new Board Support Package to allow 
execution on a new compute with the same CPU but different system devices). The 

25 booting and initialisation sequences are modified to allow the real time operating 
system to be started by the hardware resource dispatcher, in its allocated memory 
space, rather than starting itself. The hardware-probing stage of the initiahsation 
sequence is modified, to prevent the critical operating system from accessing the 
hardware resources which are assigned to other secondary systems. It reads the static 

30 hardware allocation table &om the hardware resource dispatcher to detect the devices 
available to it. 
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Trap calls 2012 are added to the critical operating system, to detect states and 
request some actions in response. A trap call here means a call which causes the 
processor to save the current context (e.g. state of registers) aad load a new context. 
Thus, where virtual memory addressing is used, the address pointers are changed. 
5 For example, when the real time operating system 201 reaches an end point (and 
ceases to require processor resources) control can be passed back to the hardware 
resource dispatcher, issuing the ••idle" trap caU, to start the secondary operating 

system. Many processors have a "'halt" instruction. In some cases, only 
supervisor-level code (e.g. operating systems, not applications) can include such a 
10 "halt" instruction, hi this embodiment, all the operating systems are rewritten to 
remove 'Tialf* instructions and replace them with an "idle" routine (e.g. an execution 
thread) which, when called, issues the "idle" trap call. 

Some drivers of the Board Support Package are specially adapted to assist the 
hardware resource dispatcher in virtuaUzing the shared devices for secondary operating 
15 systems. 

Additional * Virtual" drivers 2014 are added which, to the operating system, 
appear to provide access to an input/output (I/O) bus, allowing data to be written to the 
' bus. In fact, the virtual bus driver 2014 uses memory as a communications medium; it 
exports some private memory (for input data) and imports memory exported by other 
20 systCTis (for output data). In this way, the operating system 201 (or an appUcation 
running on the operating system) can pass data to another operating system (or 
application running on it) as if they were two operating systems running on separate 
machines connected by a real I/O bus. 

The secondary operating system 202 was selected (step 308) as Linux, having a 
25 kernel version 2.4.18 (step 308). 

In step 310, the secondary operating system kemel 202 is modified to allow it 
to function in a multiple operating system environment, which is treated as a new 
hardware architecture. As in step 306, the boot and initialisation sequences are 
modified, to allow the secondary operating system to be started by the hardware 
30 resource dispatcher, and to prevent it fix)m accessing the hardware resources assigned 
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to the other systems, as specified in the hardware resource dispatcher table. As in step 
306, trap calls 2022 are added, to pass control to the hardware resource dispatcher. 

Native drivers for shared system devices are replaced by new drivers 2028 
dealing with devices which have been virtualized by the hardware resource dispatcher 
(interrupt controller, I/O bus bridges, the system timer and the real time clock). These 
drivers execute a call to virtual device handlers 416 of the hardware resource 
dispatcher in order to perform some operations on a respective device of the computer 
100. Each such virtual device handler 416 of the hardware resoiu*ce dispatcher is 
paired with a "peer'' driver routine in the critical operating system, which is arranged 
to directly interact wifli tiie system device. Thus, a call to a virtual device handler is 
relayed up to a peer driver in the critical system for that virtualized device, in order to 
make real device access. As in step 306, read and write drivers 2024 for tiie virtual 
I/O bus are provided, to allow inter-operating system communications. 

The interrupt service routines of the secondary operating system are modified, 
to provide virtual interrupt service routines 2026 each of which responds to a 
respective virtual interrupt (in the form of a call issued by an interrupt handler routine 
412 of the hardware resource dispatcher), and not to respond to real interrupts or 
events. Routines of the secondary operating system (including interrupt service 
routines) are also modified to remove masking of hardware interrapts (at least in all 
except critical operations). In that way, the secondary operating systems 202, ... are 
therefore pre-emptable by the critical operating system 201; in other words, the 
secondary operating 



wo 2005/031572 



15 



PCT/1B2004/003334 



system response to a virtual interrupt can itself be interrupted by a real interrupt for the 
critical operating system 201. This typically includes: 

■ masking/unmasking events (interrupts at processor level); 

■ saving/restoring events mask status; 

5 ■ identifying the interrupt source (interrupt controller devices); 

■ masking/unmasking interrupts at source level (interrupt controller devices). 
New virtual device drivers 2028 are added, for accessing the shared hardware 

devices (the I/O bus bridges, the system console, the system timer and the real time 
clock). These drivers execute a call to virtual device handlers 416 of the hardware 
10 resource dispatcher in order to write data to, or read data from, a respective device of 
the coniputer 100. 

To effect this, the Linux kemel 207 is modified in this embodiment by adding 
new virtual hardware resource dispatcher architecture sub trees (nk-i386 and nk-ppc 
for the 1-386 and PowerPC variants) with a small number of modified files. 
15 Unchanged files are reused in their existing form. The original sub-trees are retained, 
but not used. 

In step 312, the hardware resoxirce dispatcher 400 is written. The hardware 
resource dispatcher comprises code which provides routines for the following 
functions as (as shown in Figure 4): 
20 ■ booting and initialising itself (402); 

■ storing a table (403) which stores a list of hardware resources (devices such as 
ports) and an allocation entry indicating to which operating system each 
resource is uniquely assigned; 

■ booting and initialising the critical operating system that completes the 
25 hardware resource dispatcher allocation tables (404); 

■ booting and initialising secondary operating systems (406) 

■ switching between operating systems (408); 

■ scheduling between operating systems (410); 

■ handling interrupts (using the real time operating system interrupt service 
30 routines, and supplying data where necessary to the virtual interrupt service 

routines of the secondary operating systems) (412); 
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■ handling trap calls from each of the operating systems (414); 

■ handling access to shared devices from the secondary operating systems (416); 

■ handling inter-operating system communications on the virtual I/O bus (41 8). 
In further embodiments (described below), it may also provide a system debugging 

5 framework. 

Operating system switcher 408 

In order to switch from an operating system to another, the operating systCTi 
switcher 408 is arranged to save the "context" - the current values of the set of state 
10 variables, such as register values - of the currently executing operating system; restore 
the stored context of another operating system; and call that other operating system to 
recommence execution where it left ofT. Where the processor uses segments of 
memory, and virtual or indirect addressing techniques, the registers or data structures 
storing the pointers to the current memory spaces are thus swapped. For example, the 
IS operating systems each operate in different such memory spaces, defined by the 
context including the pointer values to those spaces. 

In detail, the switcher provides: 
• explicit switches (e.g. trap calls) from the currently ruiming to the next scheduled 
operating systems, when the current becomes idle; and 
20 • impUcit switches from a secondary operating system to the critical operating 
system, when a hardware interrupt occurs. 

The switches may occur on a trap call or a real or virtual interrupt, as described 
below. 

25 Scheduler 410 

The scheduler 410 allocates each operating system some of the available 
processing time, by selecting which secondary operating system (if more than one is 
present) will be switched to next, after exiting another operating system. In this 
embodiment, each is selected based on fixed priority scheduling. Other embodiments 
30 allowing specification based on time sharing, or guaranteed minimum percentage of 
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processor time, are also contemplated herein. In each case, however, the critical 
operating system is pre-empted only when in the idle state. 

In further embodiments, the critical operating system may expUcitly inform the 
scheduler 410 when it may be pre-empted, so as to allow all secondary operating 
5 systems some access to the CPU to perform tasks with higher priority then the tasks 
still running in critical system. Thus, in one example, the intermpt service routines of 
the critical operating system carniot be pre-empted, so that the critical operating system 
can always respond to external events or timing signals fiom the realtime clock, 
maintaining realtime operation. 

10 

Handling virtualised processor exceptions 

The hardware resource dispatcher is arranged to provide mechanisms to handle 
processor exceptions (e.g. CPU interrapts or co-processor interrupts) as follows: 

• firstly, to intercept processor exceptions through the critical operating system; 
15 • secondly, to post a corresponding virtual exception to one or more secondary 

operating systems; to store that data and, when the scheduler next calls that 
secondary operating system, to call the corresponding virtual interrupt service 
routine 2026 in the secondary operating system; 

• thirdly, to mask or unmask any pending virtual exceptions firom within 
20 secondary operating systems. 

Virtualised exceptions are typically used for two different puiposes; 

• Firstly, to forward hardware device interrupts (which are delivered as 
asynchronous processor exceptions) to secondary operating systems; 

• Secondly, to implement inter-operating system cross-intemipts - i.e. interrapts 
25 generated by one system for another interrapts (which are deUvered as 

synchronous exceptions). 

Trap call handler 414 

The operation of the trsq) call handler will become apparent firom the following 
30 description. Its primary purpose is to allow the scheduler and switcher to change to 
another operating system when a first one halts (and hence does not require CPU 
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resources^. An aaoinonai role is to invoke hardware resource dispatch^ services such 
as a system console for use in debugging as discussed in relation to later embodiments. 

Virtaalised devices 416 

As indicated above, for each shared device (e.g. interrupt controller, bus 
bridges, system timer, realtime clock) each operating system provides a device driver, 
fomiing a set of peer-level drivers for that device. The realtime operating system 
provides the driver used to actually access the device, and the others provide virtual 
device drivers. 

The shared device handler 416 of the hardware resource dispatcher provides a 
stored data structure for each device, for access by all peer device drivers of that 
device. When the device is to be accessed, or has been accessed, the device drivers 
update the data stored in the corresponding data structure with the details of the access. 
The peer drivers use cross-inteirupts (as discussed above) to signal an event to notify 
other peer drivers that that the data structure has just been updated. 

The drivers which are for accessing interrupt controller devices use the 
virtualised exception mechanisms discussed above to handle hardware intemipts as 
follows: 

• The critical operating system device driver handles hardware interrupts and 
forwards them as virtualised exceptions to the secondary peer drivers; 

• The secondary operating system enables and disables interrupts by using the 
virtuaUsed exception masking and unmasking routines discussed above. 

I/O buses and their bridges only have to be shared if the devices connected to 
tiiem are not all allocated to the same operating system. Thus, in allocating devices, to 
the extent possible, devices connected to the same I/O bus are allocated to the same 
operating system. Where sharing is necessary, the resource allocation table 404 stores 
descriptor data indicating the allocation of the resources on the bus (address spaces, 
interrupt lines and I/O ports) to indicate which operating system has which resources. 
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Implementatioii of fhe embodiment 

Finally, in step 314, the code for the hardware resource dispatcher and 
operating systems is compiled as a distributable binary computer program product for 
supply with the computer 100. 
5 A product which may be supplied in accordance with an aspect of the invention 

is a development enviroimient product, comprising a computer program which enables 
the user to select different operating systems to be used, build and select different 
applications for each operating system, embed the application and operating systems 
into a deliverable product, and provide for booting of the operating system and launch 
10 of executable binaries of the ^plications. This is based on, and sinoilar to, the CS 
development environment, available from www.jaluna.com. 

Operation of the Embodiment During Booting and Initialisation 

Referring to Figure 5, the boot and initialisation processes according to this 
15 embodiment are performed as follows: 

A bootstrapping program ("trampoline") 4022 stored in the ROM 106 is 
executed when power is first supplied, which starts a program 4024 which installs the 
rest of the hardware resource dispatcher program 400 into memory, and starts it, 
passing as an argument a data stracture (as described below) describing the system 
20 image configuration. 

The hardware resource dispatcher initialises a serial line which may be used for 
a system console. It then allocates memory space (an operating system environment) 
for each operating system in turn, starting with the critical operating system. The 
hardware resource dispatcher therefore acts as a second level system k^nel boot 
25 loader. 

Each operating system kernel tiien goes through its own initialisation phase, 
selecting tiie resources to be exclusive to that operating system within those remaining 
in the resource allocation table 404, and starting its initial services and applications. 

Figure 6 illustrates an example of a memory address allocation forming the 
30 system image. A position within memory is allocated when the hardware resource 
dispatcher and operating systems are compiled. The set of these positions in memory 
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defines the system image, shown in Figure 6. The system image comprises a first bank 
of mraiory 602 where the hardware resource dispatcher is located; a second bank of 
memory 604 where the real time operating system is located; a third bank of memory 
606 where the secondary operating system is located; and, in this embodiment, a fourth 
5 bank of memory 608 where the RAM disk containing a root file system of the 
secondary operating system (Linux) is located. 

This system image is stored in persistent storage (e.g. read only memory for a 
typical real time device such as a mobile telephone or PBX). The remaining banks of 
memory are available to be allocated to each operating sj^tem as its environment, 
10 within which it can load and run ^pUcations. 

Allocation of Memory for Operating System Context 

Whilst being booted, each operating system then allocates a complementary 
piece of memory in order to meet the total size required by its own configuration. 
15 Once allocated to an operating system, banks of memory are managed using the 
physical memory management scheme of the operating system itself. All other 
memory is ignored by the operating system. 

Virtual Memory Allocation 

20 Each operating system is allocated separate virtual memory spaces, to make 

sure that operating systems caimot interfere with each other or with the hardware 
resource dispatcher. The User address spaces (i.e. ranges) and Supervisor address 
space (i.e. range) of each of the operating systems is each allocated a different 
memory management unit (MMU) context identifier (ID), which allow the 

25 differentiation of different virtual memory spaces having overlz^ping addresses. The 
MMUs context IDs are assigned to each operating system at the time it is compiled 
(step 3 14 of Figure 3). 

This solution avoids the need to flush translation cashes (TLBs) when the 
hardware resource dispatcher switches between different operating systems, which 

30 would take additional time. Instead, the switch over between different operating 
systems is accompUshed by storing the MMU context IDs of the currently Amotion 
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Operating system, and recalling the previously stored MMU context IDs of tiie 
switched two opiating system. 

Allocation of Input/Output Devices 

5 As indicated above, the allocation table 404 indicates which devices are 

allocated uniquely to each operating system. In addition, table 404 indicates which 
input/output resources (Direct Memory Access (DMA) devices, input/output ports, 
interrupts and so on) are allocated exclusively to such devices, thus allowing a direct 
use of these resources without any conflict. Typically, many devices are duplicated, so 
10 it is possible to reduce potential conflicts substantially in this way. 

The distribution is based on the operating system configuration scheme (for 
example, in the case of CS, the devices specified in the device tree). They are 
• allocated to operating systems at boot time, and in order of booting, so that the critical 
operating system has first choice of the available devices in the table 404 and the 
15 secondary operating systems in turn receive their allocation in what remains. As each 
operating system initialised, it detects the presence of these devices and uses its native 
drivers for them without interaction fi^om the hardware resource dispatcher. 

"Hot" Reboot of Secondary Operating System 

20 According to the present embodiments, it is possible to reboot a secondary 

operating system (for example because of a crash) whilst other operating systems 
continue to run. Because of the separation of system resources, a crash in the 
secondary operating system does not interfere with the ongoing operation of the 
critical operating system (or otiier secondary operating systems) and the rebooting of 

25 that secondary operating system does not do so either. 

In the embodiment, the system "stop" and "start" trap calls to the hardware 
resource dispatcher assist in shutting down and restarting the secondary operating 
systems from within the critical operating system. Additionally, the hardware resource 
dispatcher saves a copy of the original system image, at boot time, in persistent 

30 memory within the hardware resource dispatcher allocated memory. As an example, 
hot restart in this embodiment is managed as follows: 
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At the time of initially booting up, the hardware resource dispatcher saves a 
copy of the secondary operating systems memory image. 

The critical operating system includes a software watchdog driver routine for 
periodically monitoring the functioning of the secondary operating systems (for 
5 example, by setting a timeout and waiting for an event triggered by a peer driver 
running iti the secondary operating systems so as to check for their continued 
operation). 

If the critical operating system detects that the secondary operating system has 
failed or stopped, it triggers "stop" and then "start" trap calls (of the secondary 
10 operating system) to the hardware resource dispatcher. 

The hardware resource dispatcher then restores the saved copy of the secondary 
operatrng system image, and reboots it from memory to restart. It was found that, on 
tests of an embodiment, the Linux secondary operating system could be rebooted 
within a.few seconds from locking up. 
IS In other respects, the hot restart builds upon that available in the Chorus operating 
system, as described for example in: 

"Fast Error Recovery in CHORUS/OS. The Hot-Restart Technology" . 
Abrossimov, F. Hermann. J.C. Hugly, et al. Chorus Systems Inc. Technical Report, 
August 1996, 14p. available from: 
20 littp://www.jaluna.com/developer/papers/CSI-TR-96-34.pdf 

Run-time Operation 

The operation of the embodiment after installation and booting will now be 
described in greater detail. 

25 Having been booted and initialised, the real time operating system is ruiming 

one or more applications 207 (for example a UDP/BP stack - UDP/EP stands for 
Universal Datagram Protocol/Internet Protocol) and the secondary operating system is 
running several applications 208a, 208b (for example a word processor and a 
spreadsheet). The real time operating system microkemel 201 and the secondary 

30 operating system kemel 202 communicate with the hardware resource dispatcher 
through the hardware resource dispatcher interface which comprises: 
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• a data structure representing the operating system context (i.e. the set of state 
variables which need to be saved and restored in order to switch to the opiating 
system), and the hardware repository; 

• the set of functions which execute in the operatiug system environment; and 

5 • the set of trap call routines which execute in the hardware resource dispatcher 
environment. 

If neither operating system requires processor time (for example, both have 
reached "wait" states) then the hardware resource dispatcher 400 switches to ttie 
critical operating system's idle thread, in which it waits an interrupt or event. Thus, 

10 interrupts can be processed immediately by the critical operating system's servicing 
routines, without needing to switch to the critical operating system first. 

At some point, an interrupt or event will occur. For example, a packet may be 
received at a data port, causing an intermpt to allow it to be processed by the real time 
operating system executing the UDP/IP stack. Altematively, a user may manipulate a 

15 keyboard or mouse, causing an interrupt to operate the GUI of the second operating 
system 202 for interaction with the word processing application 208. Altematively, 
the system clock may indicate that a predetermined time has elapsed, and that an 
application should commence re-execution, or an operating system function should 
execute. 

20 The critical operating system servicing routine then services the intermpt, as 

described below. 

Interrupt and Event Handling 

If not already in the critical operating system, the hardware resource dispatcher 
25 intermpt handler 412 calls the operating system switcher 408 to switch to the critical 
operating system, and then the interrapt handler routine 412 to call an interrupt service 
routine (ISR) in the critical operating system 201. If the intermpt is intended for the 
critical operating system, either because it is from a device imiquely assigned to the 
critical operating system or because it is from a shared device and has a certain 
30 predetermined value, the critical operating system ISR takes the action necessary to 
handle the interrupt. If not, control is passed back to the hardware resource dispatcher. 
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Critical to Secondary Operating Systems Switch 

Referring to Figure 7, for this example, the system is executing a thread 702 of 
an application 207a nmning on the critical operating system 201 . 

If an interrupt occurs, a critical operating system interrupt service routine 704 
performs interrapt servicing. On termination, control passes back to the thread 702 
and any others executed by the scheduler of the critical operating system 201. When 
processing of all threads is complete, the critical operating system has finished 
executing, it schedules its "idle" thread. Accordingly the "idle" tr^ routme in the 
critical operating system issues an "idle" trap call to the hardware resource dispatcher 
400. The hardware resource dispatcher then executes a routine which does the 
following: 

• If the interrupt handler 412 currently has, some stored virtual interrupts, these 
are forwarded by the interrupt handler 412 to the secondary operating sj^tem. 

• The hardware resource dispatcher operating system scheduler 410 selects the 
secondary operating system 202 to execute. The OS switcher 408 then saves 
the current context (typically, processor MMU and status registers, instruction 
and stack pointers) in the critical OS context storage area 706. It then retrieves 
the stored execution context 708 for the secondary operating system 202, and 
writes them to the registers concemed. 

• If there are virtual interrupts for the secondary OS concerned, the intermpt 
handler 412 calls the relevant interrupt service routine 710 within the secondary 
operating system, which services the interrupt and then, on completion, reverts 
to the execution of a thread 712 of the secondary operating system where it left 
off. 

If the interrapt handler 412 currently has no pending interrupts, fh&a the 
hardware resoxurce dispatcher operating switcher 408 causes the secondary operating 
system to recommence execution where it left off, using the stored program counter 
value within the restored operating system context, in this case at the thread 712. 

Thus, after the critical operating system 201 has performed some ftmction 
(either servicing its own applications or services, or servicing an interrapt intended for 
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another operating system), the hardware resource dispatcher passes control back to the 
next secondary operating system 202, as determined by the scheduler 410. 

Secondary to Critical Operating System Switch on interrupt 

5 Referring to Figure 8, the process of transferring from the secondary operating 

system to the critical operating system will now be disclosed. Li this case, the system 
is executing a thread 712 of an application 208a running on the critical operating 
system 202. 

When a hardware interrupt occurs, the hardware resource dispatcher starts the 
10 OS switcher, to save the secondary operating system context in the context storage 
area 708. It then switches to the primary op^ting system 201, restoring the values of 
state variables from the context storage area 706, and calls the interrupt service routine 
704 of the primary operating sj^tem 201. After servicing the interrupt, the scheduler 
of the primary operating system 201 may pass control back from the ISR 704 to any 
1 5 thread 704 which was previously executing (or thread to be executed). 

When the ISR and all threads are processed, the primary operating system 201 
passes control back to the hardware resource dispatcher, which switches from the 
primary operating system 201 (saving the state variables in the context storage 706) 
and switches to a selected secondary operating system 201 (retrieving the state 
20 variables from the context storage 708), in the manner discussed with reference to 
Figure 7 above. 

Inter-operating system conmmnications - virtual bus 418 

The virtual bus routine cooperates with the virtual bus drivers in each operating 
25 system. It emulates a physical bus cotmecting the operating systems, similar to 
Compact PCI (cPCI) boards plugged into a cPCI backplane. Each operating system is 
provided with a driver routine for the virtual bus bridge device on this virtual bus, 
allowing the operating systems and their apphcations to communicate by any desired 
protocol, from raw data transfer to a ftiU IP protocol stack. 
30 The hardware resource dispatcher virtual bus is based on shared memory and 

system cross hitermpts principles aheady discussed above. In detail, the virtual bus 
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routine 418 emulates the C5 buscom DDI: syscom which defines virtual bus bridge 
shared devices, allowing the export (sharing) of memory across the virtual bus and 
triggering of cross-interrupts into other operating systems. 

Each virtual bus driver, in each secondary operating system, creates such a 
5 virtual bus bridge in the hardware resource dispatcher hardware repository at startup 
time. By doing so, it exports (shares) a region of its private memory, and provides a 
way to raise interrupts within its hosting system. 

Thus, a virtual bus driver of a first operating system sends data to a second 
operating system by: 

10 • writing into the memory exported by a peer virtual bus driver of the second 

operating system, and then; 
• triggering a cross-interrupt to notify that data are available to the peer bus 

driver in the second operating system. 
In the reverse (incoming) direction, the virtual bus driver propagates incoming data 
15 ' up-stream (for use by the application or routine for which it is iatended) when 
receiving a cross-interrupt indicating that such data have been stored in its own 
exported memory region. 

Referring to Figure 9a, an application 208a which is to communicate with 
another 208b running on the same operating system 202 can do so through that 
20 operating system. An application 207b running on one operating system 201 which is 
to communicate with another 208b running on a different operating system 202 does so 
by writing data to the virtual bus using the API of its operating system, which uses the 
virtual bus driver routine to pass the data to the other operating system 202, which 
propagates it firom its virtual bus driver to the application 208b. 
25 Referring to Figure 9b, ttie changes necessary to migrate this arrangement to 

one in which the first and second operating S5^tems run on different computers 100, 
101 are small; it is merely necessary to change the drivers used by the operating 
systems, so that they use drivers for a real bus 103 rather than the virtual bus drivers. 
The system is therefore made more independent of the hardware on which it operates. 
30 Commimication across the hardware resource dispatcher virtual bus is available 

to applications, but can also be used internally by the operating system kernels, so that 
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they can cooperate in the implementation of services distributed among multiple 
operating systems. "Smart" distributed services of this kind include software watchdog 
used for system hot restart (discussed above), or a distributed network protocol stack. 

Debugging 

In a preferred embodiment, the hardware resource dispatcher has a second 
mode of operation, in which it acts as a debugging agent. 

According to this embodiment, in the second mode, the hardware resource 
dispatcher can communicate via a serial communications line with debugging software 
tools running on another machine (the **hosf ' machine). 

Such debugging tools provide a high level graphical user interface (GUT) to 
remotely control the hardware resource dispatcher. The hardware resource dispatcher 
virtuaUsed exception mechanism is used to intercept defined exceptions. The user can 
then configure and control how the hardware resource dispatcher behaves in case of 
processor exceptions, and also display machine and system states, to enable diagnosis 
of code or other system errors or problems. 

The user can select one or more such processor exceptions as the basis for a 
trap call from an operating system to the hardware resource dispatcher. On the basis of 
the selected exception, when the or each exception occurs during execution, the 
operating system is stopped, and executes the trap call to the hardware resource 
dispatcher, which then saves the current context and enables interaction with the 
debugging tools on the host. The user can then cause the display of the current states 
of the state variables (such as the stack pointers, program and address counters) and/or 
the content of selected block of memory. The user can specify either that a given type 
of exception should be tr£q)ped in a specific operating system to be debugged, or that 
they should be trapped whenever they occur, in any operating system. In response, the 
trap call is implemented in just one, or in all, operating systems. The user can also 
specify if a given type of exception is to be normally forwarded to the system when 
restarting execution or simply ignored. 

Because the hardware resource dispatcher executes in its own environment, it is 
able to debug much more of an operating system than could be done from within that 
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system. Importantly, no code is shared between the hardware resource dispatcher 
actmg as a debug agent and the systems being debugged. This allows, for example, the 
debugging of even kemel low level code such as exception vectors or interrupt service 
routines. 

5 Some other aspects of the overall (host/target) debugging architecture 

according to this embodiment are similar to those for the Chorus and C5 debugging 
systems, described in the docmnent "C5 1.0 Debugging Guide" pubUshed by Jaluna, 
and available at: 

http://www.jaluiia.conaydoc/c5/html/DebugGmde/bookl .html 

10 

• Secure Architecture 

It will be clear that the embodiments described above give a fibcm basis for a 
secure architecture. This is because the secondary operating system, on which a user 
' will typically run insecure applications, is insulated from specified system resources, 

15 and accesses them only through the hardware resoxirce despatcher (and the drivers of 
the primary operating system). Thus, security applications can be run on the primary 
operating system which, for example, perform encryption/decryption; allow access to 
encrypted files; manage, store and supply passwords and other access information; 
manage and log access and reproduction of copyright material. Applications running 

20 on the secondary operating system cannot access system resources which are not 
allocated to that operating system, and where the operating systems run in different 
memory contexts (i.e. use different addressing pointers to different spaces) 
£q>plications ruxming on the secondary operating system cannot be used to interfere 
with those operating on the primary system so as to weaken the security of its 

25 operations. 

Intel Architecture embodiment features 

In the following, the hardware resource despatcher is described (in a non- 
30 limiting sense) as a nanokemel. This section focuses on IA-32 Intel specific aspects of 
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fhe nanokemel implementation, in particular, on the nanokemel executive which is the 
comer stone of the nanokemel environment. 

It describes how the IA-32 Intel processor architecture is used in order to 
implement the nanokemel executive which is capable to run multiple independent 
5 operating systems concurrently sharing the central and floating-point processor imits 
(CPU and FPU) as well as the memory management unit (MMU) across these 
operating systems. 

It also describes how the nanokemel executive handles the hardware interrupts. 
In particular, it describes the mechanism used to intercept and forward hardware 
10 intermpts toward the primary operating system and the software intermpts mechanism 
provided to the secondary operating systems. 

Note that we assume that the nanokemel is running on a uniprocessor computer 
and therefore aspects related to the symmetrical multi-processor (SMP) architecture is 
not addressed here. 

15 

Overview 

Virtual Address Spaces 

On IA-32 Intel architecture nanokemel always runs in a virtual address space, 
in order words, the MMU is always enabled. On the other hand, the memory context in 
20 which the nanokemel code is executing may vary in time. 

In this description the memory context term designates an IA-32 address 
translation tree which root directory table is specified by the CR3 register. 

Typically, an operating system supporting user mode processes creates multiple 
memory contexts (one per user process) in order to be able to handle private user 
25 virtual address spaces. The kernel changes the memory context each time it switches 
firom one user process to another. On the other hand, together with the user address 
spaces, the operating system kernel also handles the unique supervisor address space 
replicated in all memory contexts. User and supervisor virtual addresses never overlap 
on IA-32 Intel architecture. 
30 The supervisor address space mappings may be either static or dynamic. The 

static mapping is created at system initialization time and it typically maps (entirely or 
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partially) available physical memory. Such mapping also called the one-to-one or 
kernel virtual (KV) mapping. In particular, the KV mapping usually covers the kernel 
code, data and bss sections. Dynamic mappings are created at run time in order to 
access dynamically loaded kemel modules or dynamically allocated (non contiguous) 
5 memory chunks. 

Three kinds of memory context are distinguished in the nanokemel 
environment: primary, secondary and nanokemel. 

The primary memory context is a memory context currently used by the 
primary kemel. Note that, in case the primary operating system supports user address 

10 spaces, there might be multiple memory contexts used by the primary kemel but, as 
was already mentioned above, the supervisor address space is identical in all such 
contexts. Because the nanokemel does not care about user mappings, the primary 
memory context is unique firom the nanokemel perspective and it consists in static and 
dynamic supervisor mappings established by the primary kemel. 

15 The secondary memory context is a memory context currently used by the 

secondary kemel. Once more, in case the secondary operating system supports user 
address spaces, there might be multiple memory contexts used by the secondary kemel 
but the supervisor address space is still identical in all such contexts. Because the 
nanokemel is only aware about the static KV mapping established by the secondary 

20 kemel, the secondary memory context is unique from the nanokemel perspective (for a 
given secondary kemel) and it consists of such a one-to-one mapping. It is important to 
note that the nanokemel requires accessibility through the static KV mapping to 
secondary kemel data used by the nanokemel. Such data structures are listed in further 
sections describing the nanokemel interface to the secondary kernel. 

25 The nanokemel memory context is build by the nanokemel itself. This context 

reproduces all KV mappings established by the primary as well as by all secondary 
kernels. In order to be able to create such a memory context, the nanokemel requires 
compatibility of all KV mappings. Two KV mappings are compatible if and only if 
they either do not overlap or are identical. 

30 Note that when running multiple identical secondary kernels, their KV 

mappings are naturally identical, and therefore compatible. The problem may however 
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occur when the primary operating system is different fix>m the secondary one. In this 
case, it might he necessary to modify one of the system to ohtain the KV mappings 
compatibility. 

The nanokemel memory context is mainly used to execute the nanokemel code 
5 when a secondary kernel is preempted by an intemipt, trap or exception event handled 
by the nanokemel, for example, in order to perform an I/O operation to the nanokemel 
console. The nanokemel memory context is also used as an intermediate address space 
allowing to switch from a secondary execution environment to the primary one and 
vise versa. 

10 The nanokemel binary takes a place in tihie primary KV mapping and the 

nanokemel code is executing either in the primary or in the nanokemel memory 
context. In other words, the nanokemel code is executing in place in the one-to-one 
mapping defined by the primary kernel. When the nanokemel preempts tiie primary 
kernel, the nanokemel code is executing in the primary memory context. When the 

15 nanokemel preempts a secondary kernel, the nanokemel code is executing in the 
nanokemel context which repUcates the primary KV mapping. Note that in general 
there are no restrictions on primary data used by the nanokemel because the 
nanokemel operations called by the primary kernel are executed in the primary 
memory context and therefore the primary supervisor address space is directly 

20 accessible. On the other hand, the nanokemel requires accessibility through the static 
KV mapping to some primary kernel data used by the nanokemel during the switch 
to/from a secondary kemel. Such data structures are listed in further sections 
describing the nanokemel interface to the primary kemel. 

Figure 10 shows an example of the primary, secondary and nanokemel virtual 

25 address spaces. 

In this example the physical memory size is 128 megabytes. The primary 
kemel uses the trivial one-to-one (KV) mapping starting from zero (like C5 
microkernel) and the secondary kemel uses a shifted one-to-one (KV) mapping starting 
from OxcOOOOOOO (like Linux kemel). These KV mappings are compatible and the 

30 nanokemel address space maps the physical memory twice reproducing both one-to- 
one mappings. 
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Figure 11 shows how the memory context is switching in tune. Initially, a 
secondary operating system is running in a secondary memory context At tO time, the 
current secondary kemel traps to the nanokemel in order to output a character to the 
nanokemel console. This trap switches the current memory context to the nanokemel 
5 one. During the [^0,^7] period, the nanokemel (running in the nanokemel memory 
context) prints out a character to the nanokemel console. At tl time, the nanokemel 
returns to the secondary kemel switching back to the secondary memory context. At t2 
time an interrapt occurs while nmning the secondary operating system. The intermpt 
switches the current memory context to the nanokemel one and invokes tiie nanokemel 

10 intermpt handler. In order to forward the intermpt to the primary kemel, the 
nanokemel switches fiom the nanokemel memory context to the primary one and 
invokes the primary interrapt handler at t3 time. During the interrapt request 
processing, at t4 time, ttie primary kemel invokes the nanokemel write method in order 
to output a message on the nanokemel console. Note that this is an simple indirect call 

15 which does not switch the memory context and the write operation is entirely executed 
in the primary memory context. 

At t5 time, the nanokemel returns from the write method to the primary kemel 
which continue the tuterrapt request processing until the t6 time. At this moment, the 
primary kemel retums from the interrapt handler and the nanokemel switches back to 

20 the interrapted secondary operating system in order to continue its execution. Such a 
switch starts in the primary memory context and, going through the intermediate 
nanokemel context, finally ends up in tibe secondary memory context at t7 time. 

Nanokernel Invocation and Preemption 

25 The nanokemel is invoked either expUcitly trough a frmction call/trap or 

implicitly througji an interrapt/exception handler. In the former case, we say that an 
operating system kemel invokes the nanokemel. In the latter case, we say that the 
nanokemel preempts an operating system. It is important to underline that the 
nanokemel is always invoked from the privileged code running in the supervisor 

30 address space. On the other hand, the nanokemel may preempt as the kemel itself as 
well as an user process running under kemel control. 
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Once the system is booted, the nanokemel is activated first and it starts 
execution of the primary and secondary kernels. Once the initialization phase is done, 
the nanokemel plays a passive role. This means that the code executed in the 
nanokemel is driven by the primary and secondary kernels explicitly invoking the 
nanokemel (by call or trap) or by externally generated synchronous (i.e., exceptions) 
and asynchronous (i.e., intermpts) events. 

On IA-32 Intel architecture, mechanisms used for the nanokemel invocation 
and preemption are different for primary and secondary operating systems. In t^cms of 
execution environment, the nanokemel is quite closed to the primary kemel. It uses the 
same memory context and, sometimes, the same supervisor stack. Thus, the nanokemel 
has roughly the same availability as the primary kemel. On the other hand, there is a 
barrier between the secondary operating systems and nanokemel providing some 
protections against the secondary kemel malfunction. Note however that such a 
protection is not absolute and a secondary kemel is still able to crash the primary 
kemel as well as the nanokemel. 

Primary Invocation 

The primary kemel invokes the nanokemel by a simple indirect call. The 
memory context is not switched by invocation. 

Primary Preemption 

The nanokemel preempts the primary operating systan tiurough an intermpt 
gate. The memory context is not switched by preemption and the native primary 
supervisor stack is used to handle the preemption. 

The nanokemel preempts the primary operating system only in rare cases. One 
of them is the device not available exception (#NM) used by the nanokemel to handle 
the FPU sharing between kernels in a lazy fashion as described further in this 
document. 
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Secondary Invocation 

A secondary kernel invokes the nanokemel by a trap. The nanokemel intercepts 
such a trap by a task gate which switches to the nanokemel memory context and starts 
5 the trap task execution. 

Secondary Preemption 

Hie nanokemel preemption of a secondary operating system is similar to the 
invocation mechanism and is based on task gates. When a secondary system is 
10 preempted by an interrapt or excq)tion, the corresponding task gate switches to the 
nanokemel memory context and starts execution of the corresponding nanokemel task. 

Kernel Context 

The nanokemel data can be split on two categories: the global and per-kemel 
15 data. The global data keeps. the global nanokemel state (e.g., the nanokemel mraiory 
context) while the per-kemel data keeps a state associated with a given primary or 
secondary kernel. The per-kemel data is also called the kernel context. 

The kernel context consists of two parts: visible and hidden. The visible part is 
public and takes a part in the nanokemel interface. This part of the kernel context is 
20 described in detail in further sections related to the nanokemel interface. The hidden 
part is not visible to kemels and is used internally by the nanokemel executive. 
Nanokemel Executive Interface 

This chapter describes the nanokemel executive interface exported to the 
primary and secondary kemels. Such an interface consists in a data shared between a 
25 kernel and the nanokemel (i.e., visible kemel context) as well as the nanokemel 
methods. Note that flie nanokemel interface is kemel role specific and is (strictly 
saying) different for the primary and secondary kemels. On the other hand, there is a 
quite significant intersection between these two interfaces which can be described 
independently from the kemel role. 



30 
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Visible Kernel Context 

Figure 12 illustrates the visible part of the kernel context 
All kernel contexts (primary as well as secondaries) are linked in a circular list. 
The next field refers to the next kernel context within such a list. Note that, in the 
5 visible part of the kernel context, all references are made through physical addresses. A 
kernel has to convert such a physical address to the virtual one (firom the KV mapping) 
in order to access the referenced object. The picture shows a configuration with only 
two kemels: primary and secondary. The primary context points to the secondary one 
which, in turn, points back to the primary context. 

10 The pending VEX said enabled KE^ fields reflect the current state of the virtual 

exceptions. Note that these fields are meaningless for the primary context because the 
primary kernel exceptions are not virtualized by the nanokemel. The virtualized 
exceptions mechanism is described in detail fiirther in this document together with the 
secondary kernel execution model. 

15 The boot info field points to the boot parameters given by BIOS. This field is 

read-only. 

Note that such a data structure is kemel specific and therefore it is also located 
in the kemel context. Among other fields, the boot parameters structure points to the 
boot command line specifying the boot time parameters. Such parameters are ether 
20 given to the boot loader (e.g., GRUB boot loader) or passed through the nanokemel 
environment The command line is kemel specific and it is located ia the kernel 
context as well. The nanokemel parses the initial conomand line in order to create 
kemel specific command lines containing only parameters related to the corresponding 
kemel. 

25 The RAM info field points to the RAM description table. This field is read-only. 

The RAM description table is a global data stmcture shared by all kemels. It describes 

how the RAM resource is distributed across the kemels. 

The dev info field points to the list of virtual devices abstracted by the 

nanokemel. This field is read-only for a secondary kemel and read-write for the 
30 primary one. The devices Ust is global and it is shared by all kemels. Each virtual 

device in the Ust is represented by a data stmcture specified by the nanokemel. This 
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data structure is typically accessed by both primary and secondary peer drivers using 
rules defined by the nanokemel. The primary peer driver plays a server role supporting 
the virtual device while the secondary peer driver plays a client role using the virtual 
device instead of the real one. This list is created (and modified) by the primary kemel 
only. A secondary kemel is only allowed to browse this list. 

The pending XIRQ field specifies pending cross intermpts. This field is not 
used by the nanokemel itself. It is hosted by the context stmcture in order to assist to 
the primary and secondary kernels in the cross interrupts exchange. There is only one 
exception dedicated to the cross intermpt delivery. The pending XIRQ field allows to 
extend the number of cross intermpts up to 32 (one bit per cross intermpt source). A 
cross intermpt bit is set by the source kemel (i.e., the kemel which sends cross 
intermpt) and it is reset by the destination kemel (i.e., the kemel which receives the 
cross intermpt). 

The ID field contains a unique kemel identifier: This field is read only. 
Identifier 0 is assigned to the nanokemel itself and identifier 1 is assigned to the 
primary kemel. The kemel identifier designates the kemel in the nanokemel interface. 
For example, the kemel identifier is used to tag resources assigned to a given kemel 
(e.g., memory chunks in the RAM description table). 

The running field is a flag specifying the kemel state: running or halted. This 
field is read only. The nanokemel sets this flag before launching the kemel and clears 
it once the kemel is halted. When a kemel is restarted, the running flag is fixst cleared 
and then set. Any kemel is able to browse the circular list of kemel contexts and to 
analyze the running flag in order to find out all running peer kemels. Note that the 
running flag is always set for the primary kemel. 

The final part of the visible kemel context is role specific. 

The primary context specifies addresses of the nanokemel interface methods. 
The primary kemel uses these addresses in order to invoke the nanokemel through an 
indirect fimction call. The methods addresses are set up by the nanokemel and they 
must not be modified by the primary kemel. The nanokemel interface methods are 
described in detail in the next section. 
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The secondary kernel uses the tr^ mechamsm to invoke the nanokemel rather 
than an indirect call. So, addresses of the nanokemel interface methods are not present 
in the secondary context Instead, the secondary context has a secondary TS bit field 
which keeps the TS bit state of the CRO register. Such a software image of the TS bit 
5 may be used by the secondary kernel in order to manage the FPU resource in a lazy 
way as described in detail further in this document. 

Nanokemel Methods 

The nanokemel provides two groups of methods: the console I/O operations 
10 and the executive operations. The console I/O group allows a kernel to send/receive 
characters to/from the nanokemel console serial line. Tbis document does not sfpecially 
address the console I/O methods which are more or less generic but rather it is focused 
on the executive methods which are IA-32 Intel architecture specific. 

Basically, the nanokemel environment replaces some IA-32 Intel processor 
IS instructions with the nanokemel methods. Such substituted instmctions typically load 
or store some IA-32 Intel processor registers: 

©Global Descriptor Table Register (GDTR) 
©Interrupt Descriptor Table Register (IDTR) 
©Task Register (TR) 

20 

Load/Store GDT Register (LGDT/SGDT) 

Lastead of loading/storing directly to/fix>m the GDT register via the IA-32 
Igdt/sgdt instmctions, in the nanokemel enviroxmient, a kemel has to invoke the 
Igdt/sgdt nanokemel methods to do so. These methods are similar for the primary and 

25 secondary kemels except that they are indirect calls for the primary kemel and traps 
for the secondary ones. 

Similar to the processor instmctions, the Igdt/sgdt nanokemel methods take 
only one parameter specifying a 6-byte memory location that contains the native table 
base address (a virtual address) and the native table limit (size of table in bytes). It is 

30 important to underline that the native GDT must always be located within the KV 
mapping (even for the primary kemel). 
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The nanokemel manages a per-kemel global descriptor table. This (real) table 
resides in the hidden part of the kernel context shown on Figure 13. Together with the 
real GDT, the nanokemel keeps a pointer to the native GDT which is given to the 
nanokemel via the Igdt method. The nanokemel initializes the real GDT from the 
5 native one by copying the segment descriptors. Note however that a part of the real 
GDT is reserved for the nanokemel segments. The nanokemel handles a bit string 
specifying a mapping between the native and real tables. For each entry in the real 
table, the mapping specifies whether the entry is used for a nanokemel segment, or it is 
inherited from the native table and therefore contains a copy of the corresponding 
10 kernel segment. The real entries used for the nanokemel segments are not updated by 
the Igdt method. 

The nanokemel segments are located at the end of the real table which default 
size is 256 entries. When porting a kemel to the nanokemel architecture, an overlap 
between the kemel and nanokemel segments should be avoided by either re-arranging 
15 • kemel segments within the native table (moving them to the begiiming of the table) or 
increasing the real GDT size. 

Load/Store IDT Register (LIDT/SIDT) 

Instead of loading/storing directly to/from the IDT register via the IA-32 
20 lidt/sidt instmctions, in the nanokemel environment, a kemel has to invoke the lidt/sidt 
nanokemel methods to do so. These methods are similar for the primary and secondary 
kernels except that they are indirect calls for the primary kemel and traps for the 
secondary ones. 

Similar to the processor instmctions, the lidt/sidt nanokemel methods take only 
25 one parameter specifying a 6-byte memory location that contains the native table base 
address (a virtual address) and tihie native table limit (size of table in bytes). It is 
important to underline that the native IDT must always be located within the KV 
mapping (even for the primary kemel). 

The nanokemel manages a per-kemel intermpt descriptor table. This (real) 
30 table resides in the hidden part of the kemel context shown on Figure 13. Together 
with the real IDT, the nanokemel keeps a pointer to the native IDT which is given to 
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the natiokemel via the lidt method. The nanokemel initializes the real IDT fix)m the 
native one by copying the gate descriptors. 

Note that the nanokemel caa install its own gate in tiie real table in order to 
intercept an exception. For example, the nanokemel intercepts the device not available 
5 exception (#NM) in order to manage the FPU sharing in a lazy fashion. So, similar to 
GDT, the nanokemel handles a bit string specifying a mapping between the native and 
real tables. For each entry in the real table, the mapping specifies whether the entry is 
installed with a nanokemel gate, or it is iiaherited firom the native table and therefore 
contains a copy of the corresponding kemel gate. The real entries installed with the 
10 nanokemel gates are not updated by the lidt method. 

The real table size must be equal to or greater than the native table size. If this 
requirement is not meet when porting a kemel to the nanokemel architecture, either the 
real table size has to be increased or the native table size has to be reduced. 

15 Load Task Register (LTR) 

Instead of loading directly to the task register via the IA-32 Itr instmction, in 
the nanokemel enviroimient, a kemel has to invoke the Itr nanokemel method to do so. 
This method is similar for the primary and secondary kemels excqpt that it is an 
indirect call for the primary kemel and a trap for the secondary ones. 
20 Similar to the processor instmction, the Itr nanokemel method takes only one 

parameter specifying a segment selector that points to a task state segment (TSS). It is 
important to underline that the TSS pointed out by the segment selector must be 
always located within the KV mapping (even for the primary kemel). 

Idle 

25 The nanokemel provides an idle method which has to be called by a kemel 

within an idle loop. The idle method is equivalent to tiie IA-32 Mel hit instruction and 
it informs the nanokemel that the calling kemel has nothing to do imtil the next 
interrapt. This method is similar for the primary and secondary kemels except that it is 
an indirect call for the primary kemel and a trap for the secondary ones. 
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The idle method invocation results in a system switch to the next ready to run 
secondary kernel (if any) or in the retum from the primary idle method when all 
secondary kernels are idle. The idle method has no parameter. 

The primary idle method should be called with enabled processor interrupts and 
it always returns to the caller with disabled processor interrupts. So, once retumed 
from the nanokemel idle method, the primary kernel is able to directly execute the lA- 
32 sti instruction followed by the IA-32 hit instruction in order* to suspend the 
processor imtil the next interrupt. 

The secondary idle trap can be called witibi either enabled or disabled (software) 
interrupts and it always retums to the caller with enabled interrapts. In fact, the 
secondary idle trap implicitly enables intemipts and it retums to the caller once an 
interrupt has been delivered to this kernel as a virtual exception (VEX). 

Restart 

The nanokemel provides a restart method which can be called as by the 
; primary as well as by a secondary kemel in order to restart a secondary kemel. This 
method is similar for the primary and secondary kemels except that it is an indirect call 
for the primary kemel and a trap for the secondary ones. 

The method parameter specifies identifier of the kemel being restarted. The 
nanokemel stops the kemel execution, restores the kemel image from its copy and 
finally starts the kemel execution at the initial entry point. 

Secondary Reboot 

The reboot trap is provided by the nanokemel to a secondary kemel. Such a 
trap is called by a secondary kemel when it is rebooting. This trap is equivalent to the 
restart trap called on the kemel itself. 

Secondary Halt 

The halt trap is provided by the nanokemel to a secondary kemel. Such a trap is 
called by a secondary kemel when it is halted. The nanokemel puts the caller kemel 
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into a non running state in order to avoid this kernel being switched in by the 
nanokemel scheduler. 

A stopped kernel can be started again by the restart nanokemel method 
described above. 

Primary Execution Environment 

Basically, the primary kernel is executing in the native execution environment. 
The nanokemel implementation on IA-32 Intel processor tries to minimize impact of 
the nanokemel environment to the primary operating system characteristics 
(performance, intermpt latency, preemption latency). Because the primary operating 
system is typically a real-time operating system, it is important to keep ttie primary 
kernel behavior unchanged even if other (secondary) operating sj^tems are running 
concurrently on the same processor. 

Initialization 

The nanokemel is started first by the boot loader with disabled MMU, i.e., in 
the physical space. Basically, the nanokemel initialization code installs the primary 
memory bank (containing the primary kemel code/data/bss sections) in the physical 
memory and jumps to the primary entry point. 

Before jiunping to the primary kemel, the nanokemel initializes the primary 
kemel context, and in particular, the real GDT and IDT to an initial state. 

The initial primary GDT has only two valid entries specifying the nanokemel 
code and data segments. The selectors used for tiie kemel code and data segments are 
fixed by the nanokemel interface to 0x10 and 0x18 respectively. So, when porting a 
kemel to the IA-32 nanokemel architecture, the above code and data selectors have to 
be used. 

All gates in the initial primary IDT as well as the task register are invalid 
(zeroed). 

The nanokemel initialization code is executed using a static nanokemel stack 
located in the data section. When jumping to the primary kemel, this stack is still valid. 
Despite of that, the primary kemel should switch to its own stack as soon as possible 
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and should never use this nanokemel stack in the future. The nanokemel stack is used 
not only at initialization phase but also at run time in order to handle secondary 
invocations and preemptions as described in the next chapter. 

When jumping to the primary kernel, the %esi register points to the kernel 
5 context and the eflags register is cleared. So, processor interrupts are disabled at the 
beginning of the primary initialization phase. The primary kemel usually enables 
interrupts once a critical initialization phase is done. 

During the initialization phase, the primary kemel typically invokes the 
nanokemel methods m order to setup the GDT, IDT and task registers. Finally the 
10 primary kemel enters in the idle loop and invokes the nanokemel idle method. 

When the idle method is called first time, the nanokemel considers that the 
primary kemel has fully initialized its execution enviroronent and it proceeds to the 
post initialization phase. 

In such a post initialization phase, tiie nanokemel builds the nanokemel 
15 memory context and initiaUzes the secondary kemel contexts as described in the next 
chapter. Note that the nanokemel memory context creation is deferred until the post 
initialization phase because it requires allocation of physical memory for building the 
translation tree but the available memory resource is discovered and registered by the 
primary kemel initialization code. Once the post initialization is done, the nanokemel 
20 calls the scheduler in order to either switch to a ready to run secondary kemel or retum 
fix>m the primary idle method if all secondary kemels are idle. 

The nanokemel requires the primary kemel to initialize the globally shared data 
stmctures: tiie RAM descriptor and the virtual devices list. Such an initialization has to 
be done before the idle method is called. This requirement is natural because beyond 
25 this moment a secondary kemel can access the globally shared data stractures. 

In particular, the primary kemel is in charge to detect the physical memory 
available on the board and to register free physical memory chunks in the RAM 
descriptor. 

According to the primary Board Support Package (BSP), the primary kemel 
30 should start nanokemel aware drivers which, in tum, should populate the virtual 
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devices list. Such virtual devices are provided to secondary kernels and therefore they 
should be created before the first secondary kernel is started. 

Intercepted Exceptions 

5 Basically, the nanokemel does not intercept exceptions which occur when the 

primary operating system is running on the processor. All programming exceptions, 
tXBps and interrupts are handled by native primary handlers. The primary low-level 
handlers do not need to be modified when porting to the IA-32 Intel nanokemel 
architecture. 

10 An exception firom the above rule is programming exceptions related to the 

FPU emulation: 

<Dthe invalid opcode exception (#UD) 
€^the device not available exception (#NM) 
The FPU emulation feature is used by the nanokemel to implement a lazy 
15 mechanism of FPU sharing as described further in this document. 

.Another special case is a debug agent' which could be embedded in the 
/ nanokemel in order to provide a host based remote system debugging of the primary 
. operating system. In this case, the debug agent usually intercepts some synchronous 
exceptions related either to debug features (e.g., single instruction trace) or to program 
20 errors (e.g., page faiilt) as described above in more general terms. 
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Forwarded Interrupts 

When an interrupt occurs while a secondary operating system is running on the 
processor, the interrupt is forwarded to the primary operating system. Such an interrupt 
5 forwarding process goes through the following major steps: 

©the interrupt is intercepted by the nanokemel; 
©execution of the preempted secondary kemel is suspended and 
the nanokemel switches to the primary execution enviroimaent; 

©the nanokemel triggers the corresponding intermpt to the 
10 primary kemel using an itn instmction. 

In such a way the corresponding primary low-level intermpt handler is invoked 
(in the primary execution environment) in order to process the interrupt Once the 
intermpt is processed, the primary kemel returns to the nanokemel executing an tret 
instmction. 

15 After returning from the primary intermpt handler, the nanokemel calls the 

scheduler in order to determine the next secondary operating system to run. Note that 
the preempted secondary system would not necessary be continued after interrapt. 
Another (higher priority) secondary system may become ready to run because of the 
interrapt 

20 

Secondary Execution Environment 

Basically, the secondary kemel execution environment is quite closed to the 
native one except for the interrupts management The nanokemel environment 
modifies the native mechanism of the intermpts management in order to make a 

25 secondary operating system fully preemptable. A secondary kemel ported to the 
nanokemel architecture no more disables intemipts at processor level but rather uses a 
software intermpts masking mechanism provided by the nanokemel (i.e., virtual 
exceptions). Intermpts are no more directly processed by such a secondary kemel, but 
rather they are intercepted by the nanokemel, forwarded to the primary kemel and only 

30 then optionally processed by the secondary kemel in a deferred way. 
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Initialization 

The nanokemel installs the secondary memory banks at initialization time 
together with primary banks. On the other hand, the final initialization of a secondary 
kernel, in particular the kernel context setup, is deferred until the post initialization 
5 phase. 

At this phase, the nanokemel allocates memory to keep a copy of secondary 
manory banks. Such a copy is then used to restore the initial image of secondary 
system at restart time. The secondary system restart is however optional and it migjit 
be disabled in order to reduce the physical memory consumption. 
10 Analogous to the primary kernel, the nanokemel initializes the real GDT and 

IDT as well as the initial TSS located in the hidden part of the kernel context (see 
Figure 13). 

Similarly to the primary real GDT, the initial secondary real GDT has two valid 
entries specifying the nanokemel code and data segments. Segment selectors for the 
15 . nanokemel code and data are assigned by the nanokemel interface to 0x10 and 0x18 
respectively. In addition, the secondary real GDT contains descriptors specifying the 
. nanokemel TSS data structures used by the nanokemel tasks. Such nanokemel tasks 
. are used to intercept secondary exceptions as described in the next section. The 
nanokemel TSS descriptors are located at the end of the real GDT. 
20 In the real IDT, flie nanokemel installs task gates in order to intercept hardware 

intermpts and nanokemel tcaps. In order to be able to handle a fatal exception at 
secondary initialization time, all other exceptions are also temporarily int^epted by 
the nanokemel imtil a native IDT installed via the lidt nanokemel trap. If such an 
outstanding (fatal) exception occurs, the nanokemel simply halts the secondary kernel 
25 but it disturbs neither primary nor other secondary systems. Once a native IDT is 
installed, the initially used fatal exception gates are overridden by the native ones. 
Note however that it does not concerns the permanently intercepted exceptions 
described in the next section. 

The nanokemel launches a secondary kernel executing a task switch to the 
30 initial TSS located in the secondary kemel context. Figure 14 shows how an initial 
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TSS is initialized prior to the task switching. Note that only non zero fields are shown 
on the picture while all zero fields are shadowed. 

Analogous to the primary kernel, the kernel context physical address is passed 
on the %esi register. On the other hand, unlike the primary kemel, the interrupt flag 

5 (IF) is set in the processor flags field (EFLAGS) enabling processor interrupts even 
during the secondary kemel initiahzation phase. It should be noted that even the 
secondary kemel initialization code is fully preemptable by the primary system. This is 
particularly important in order to do not disturb the primary operatrug system when a 
secondary operating system is restarted. 

1 0 Despite of enabled hardware intennpts, the virtual exceptions (corresponding to 

hardware intemq>ts) are disabled when a secondary kemel is started. So, intermpts are 
not delivered by the nanokemel until they are explicitly enabled by the kemel at the 
end of the critical initialization phase; The software interrapts masking mechanism 
(based on virtual exceptions) is described in detail further in this document. 

15 The CR3 field porats to a one-to-one translation tree. Such an initial one-to-one 

/ mapping is temporarily provided to a secondary kemel. Note that this mapping should 
not be modified or permanently used by the initialization code, instead, the secondary 
kemel should build its own KV mapping and switch to it as soon as possible. 

The stack pointer is invalid when a secondary kemel is started. Usually, the 

20 secondary kemel uses a static initial stack located in the data section in order to 
execute its initialization code. 

Analogous to the primary kernel, during the initialization phase, a secondary 
kemel typically invokes the nanokemel traps in order to setup the GDT, IDT and task 
registers. Finally the secondary kemel enters in the idle loop and invokes the 

25 nanokemel idle trap. 

Intercepted Exceptions 

In order to intercept a secondary exception, the nanokemel installs a task gate 
to the corresponding entry of the real IDT. Thus, when such an exception occurs, the 
30 IA-32 Intel processor performs a task switch which saves the processor state to the 
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current task state segment (TSS) and restores the processor state from the TSS 
specified by the exception task gate. 

For each intercepted exception, the nanokemel creates a dedicated TSS data 
structure pointed out by a dedicated segment descriptor located in the real GDT. Such 
5 nanokemel segments (used to reference the nanokemel TSS data stractures) are located 
at the end of the real GDT. All nanokemel TSS data structures are similarly initialized 
according to the nanokemel execution environment Figure 15 shows non zero fields of 
a nanokemel TSS. The zeroed part of the TSS data stmcture is shadowed on the figure. 

The BIP field contains address of a nanokemel exception task. The EBX field 
10 points to an exception descriptor. Note that multiple intercepted exceptions can be 
multiplexed in the same nanokemel exception task. For example, the same task is used 
to intercept all hardware intermpts. hi this case, such a multiplexed task can use the 
exception descriptor (available on the %ebx register) in order to obtain an exception 
specific information. 

15 All intercepted exceptions can be classified according to its nature as intermpts, 

traps and programming exceptions (faults). 

The nanokemel intercepts all hardware intermpts (including the non maskable 
intermpt (NMT)) in order to forward when to the primary kernel. 

Traps intercepted by the nanokemel are, in fact, the following nanokemel 
20 invocations: 

<Dgeneric trap 
©XIRQtrap 
©STItrap 
©IRETtr^ 

25 

The generic trap combines all non performance critical nanokemel invocations 
like console I/O, Igdt/sgdt, lidt/sidt, Itr, halt, reboot, restart. The nanokemel method 
nimiber and arguments are passed on general purpose registers as for a conventional 
trap. The generic trap is handled by a conmion exception task which invokes 
30 nanokemel methods according to the nimiber coded in the %eax register. 
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Other three traps are performance critical and they are handled by specific 
nanokemel tasks. These traps have no arguments. 

The XIRQ trap sends a cross interrupt to the primary kernel. The XIRQ trap 
task is equivalent to the interrupt task except that the exception forwarded to the 
5 primary kemel corresponds to a software interrupt rather than to a hardware one. So, 
like an interrupt, the XIRQ trap preempts the current secondary kemel. 

The STI and IRET traps both called by a secondary kemel in order to process 
pending virtual exceptions. These traps take a part in the software intermpts masking 
mechanism and they are described in detail in the next section dedicated to the virtual 
10 exceptions. 

Analogous to the primary kemel, the nanokemel usually does not intercept 
programming exceptions except some special cases described below. 

The nanokemel intercepts the following exceptions related to the FPU 
emulation: 

15 ©the invalid opcode exception (#UD) 

ODthe device not available exception (#NM) 
The FPU emulation feature is used by the nanokemel to implement a lazy 
mechanism of FPU sharing as described further in this document. 

Another special case is a debug agent which could be embedded in the 
20 nanokemel in order to provide a host based remote system debugging of the secondary 
operating system. In this case, the debug agent usually intercepts some synchronous 
exceptions related either to debug features (e.g., singpie instmction trace) or to program 
errors (e.g., page fault). Such a debug agent design however is out of scope of this 
document. 

25 

Virtual Exceptions 

Virtual exceptions (VEX) is a mechanism provided by the nanokemel which 
allows a kemel to post an exception to a secondary kemel and to deliver it in a deferred 
manner. In particular, the VEX mechanism is used in the IA-32 Intel nanokemel 
30 architecture in order to replace hardware interrupts with software ones for a secondary 
kemel. 
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The VEX interface consists in two field located in the kernel context: pending 
and enabled. These fields are meaningful only for a secondary kernel context but they 
are accessed by both the primary and secondary kernels. All virtual exceptions are 
naturally eniraierated by the bit position in the pending (or enabled) field. So, there are 
5 in total 32 virtual exceptions supported by the nanokemel on the IA-32 Intel 
architecture (the pending and enabled fields are 32 bit integer values). 



The table below shows how the virtual exceptions are mapped to the real ones: 



Virtual Exception 


Real Exception 


Description 


0 


2 


NMI 


1-16 


32-47 


IRQ0-IRQ15 


17 


48 


Cross Liteinq>t 


18-30 






31 




Running 



10 Virtual exceptions from 0 up to 16 are mapped to the hardware interrupts. The 

virtual exception 17 is mapped to the real exception 46 used to deliver cross interrupts 
to the secondary kernel. The virtual exceptions from 18 up to 30 are not currently used 
and they are reserved for fiiture extensions. The virtual exception 31 does not 
correspond to any real exception and it is in fact a pseudo virtual exception which is 

15 used intCTially by the nanokemel is order to detect whether the kernel is idle. How 
such a pseudo virtual exception works is described in detail fiirther in this docimient. 

Because multiple virtual exceptions can be pending at the same time but only 
one of them can be processed at time, all virtual exceptions are prioritized according to 
its number. The highest priority is assigned to the NMI and the lowest priority is 

20 assigned to the Running pseudo exception. 

The pending VEX field of a secondary context is typically updated by the 
primary kernel which provides a driver for the virtual PIC device. Such a driver 
usually posts virtual exceptions (interrupts) to secondary kernels by setting appropriate 
bits in the pending KEAT field. 
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The enabled VEX^^ld is updated by the secondary kernel in order to enable or 
disable virtual exceptions. A given virtual exception is enabled if the corresponding bit 
is set in the enabled VEX field. Using the enabled VEX field, a secondary kernel 
implements critical sections protected against interrupts. Li other words, a secondary 
kemel no more uses the cli and sti A-32 instructions to disable/enable processor 
interrupts but rather modifies the enabled field of its kemel context. 

A given virtual exception is delivered by the nanokemel if it is pending and 
enabled simultaneously. The nanokemel resets the corresponding pending bit just 
before jumping to the secondary exception handler. 

When delivering a virtual exception to a secondary kemel, the nanokemel 
interprets the gate descriptor firom the native IDT. Li order to minimize modifications 
in low-level handlers of a secondary kemel, the nanokemel calls the gate handler in the 
same state as the IA-32 Litel processor does, fii other words, the nanokemel switches 
the stack pointer, the code and stack segments and pushes the exception firame into the 
supervisor stack in the same way as the IA-32 Intel hardware does. 

Note however that, when porting a secondary kemel on the IA-32 nanokemel 
architecture, low-level exception handlers have still to be modified in order to take into 
accoimt the software interrapts masking mechanism which substitutes the hardware 
one. When calling an intermpt gate handler, the nanokemel only disables all virtual 
exceptions writing 0x80000000 to the enabled field. The hardware interrapts are 
always enabled at processor level when ruiming a secondary kemel and therefore a 
secondary kemel can be preempted by the primary one even inside a low-level 
interrupt gate handler. Li such a way, in the nanokemel environment, a secondary 
operating system becomes fiiUy preemptable by the primary operating system. 

A virtual exception can be posted by the primary kemel while it is in disabled 
state. It this case, the exception is not delivered to the secondary kemel but it is rather 
kept pending until the exception is re-enabled again. So, when virtual exceptions are 
re-enabled by a secondary kemel, a check should be made whether any virtual 
exceptions are pending. If the check is positive, the secondary kemel should invoke the 
nanokemel in order to process such pending virtual exceptions. Such invocation is 
performed by means of either STI or IRET trap. 
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jn general, a secondary kernel re-enables virtual exceptions in two following 

cases: 

©when virtual exceptions has been previously disabled by the secondary 
kernel in order to protect a critical section of code; 

©when virtual exceptions has been disabled by the nanokemel as result of 
an interrupt gate invocation. 

In the former case, the secondary kernel uses the STI trap to process pending 
virtual exceptions if any. Once pending exceptions are processed, the nanokemel will 
return from the STI trap in order to continue the secondary kernel execution. 

In the latter case, the secondary kernel uses the IRET trjq) to process pending 
virtual exceptions when returning from the exception handler. Note that the IRET trap 
just substitutes the iret IA.-32 instruction and when the tr^ is executed the exception 
frame is still pushed into the supervisor stack. Note also that the nanokemel does not 
return from the IRET trap, instead, once pending exceptions are processed, the 
secondary operating system execution is continued at the point it has been preempted 
by the initial virtual exception. Li other words, the IRET trap returns to the state saved 
in the exception frame located at the top of stack at trap time. 

Nanokemel Re-Entrance 

The nanokemel code is mostly executed with interrupts disabled at processor 
level preventing re-entrance inkemel. On the other hand, some nanokemel invocations 
may take a long time and therefore the nanokemel has to enable interrupts when 
executing such long operations in order to keep the primary interrupt latency low. 
There are three kinds of long nanokemel operations: 

©synchronous console ou^ut 
The operation duration depends on the serial line speed. For example, on a 
9600 baud rate line, a single character output may take up to 1 millisecond. 
©Igdt and lidt 

The operation duration depends on the table size. These operations can still be 
done with disabled interrapts for the primary kernel because they are typically issued 
at initialization time when intermpts are usually disabled anyway. For the secondary 
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igai and adt memods however, it is clearly not acceptable to keep interrupts disabled 
because a secondary kernel can be restarted at any time, 
©secondary kernel restart 
The operation duration depends on the kernel image size which is restored from 

a copy. 

For all operations hsted above, the nanokemel enables intermpts and therefore 
re-entrance from the primary kemel. On the other hand, while interrupts are enabled, 
the nanokemel scheduler is disabled in order to prevent another secondary kemel to be 
scheduled when returning from the primary mteirapt handler. In other words, the 
nanokemel can be preempted by the primary kemel only (as result of an interrapt) but 
re-entrance from a secondary kemel is prohibited. Such a restriction allows the 
nanokemel to use global resources for the secondary execution environment. For 
example, TSS data structures used for intercepted exceptions are global and shared 
across all secondary kernels. 

Some long operations issued from a secondary kemel can be executed in the 
primary memory context. In other words, before executing such an operation, the 
nanokemel switches to the primary execution context and then enables intermpts. 
Once the operation is done, the nanokemel disables intermpts and returns to the caller 
secondary kemel through the nanokemel scheduler. 

Note however that some long operations cannot be executed in the primary 
memory context because they require access to data structures located in the secondary 
KV mappmg. A typical example of such operations are secondary Igdt and lidt 
methods which access native GDT and IDT data stractures respectively. So, these 
operations must be done in flie nanokemel memory context. 

Note also that it is preferable to execute frequently used nanokemel methods in 
the nanokemel memory context (even if they can be executed in the primary memory 
context as well) in order to avoid an extra overhead introduced by the switch to/from 
the primary execution environment. A typical example of such a frequent operation is 
a synchronous output on the nanokemel console. 

The discussion above shows that the nanokemel must be capable to enable 
processor intermpts while executing code in the nanokemel memory context with 
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activated secondary GDT and IDT. In other words, the nanokemel must support a task 
switch to the interrupt task while running the trap task executing a secondary 
nanokemel method. 

In order to support such a tasks nesting, the nanokemel handles a per-kemel 
5 TSS stack located in the hidden part of the kemel context as shown on Figure 13. This 
data stmcture are meaningful only for a secondary kemel context. The top of stack 
points to the current TSS, i.e., to the TSS pointed out by the task register. The stack is 
updated each time a task switch is performed to/jfrom the secondary kemel. When a 
nanokemel task is activated by a task gate, the task TSS is pushed into the stack. When 
10 the nanokemel returns from a nanokemel task, the top TSS is removed from the stack. 
. The TSS stack is also updated by the secondary Itr method which changes the native 
secondary TSS located at the stack bottom. 

. Figure 16 shows typical states of a TSS stack. The top half of the figure depicts 
the TSS stack evaluation when a secondary kemel is preempted by an intermpt. The 
15 bottom half of the figure depicts the TSS stack evaluation when the nanokemel 
executing a long secondary operation is preempted by an interrapt. Note that the TSS 
. stack is never empty and the maximal stack depth is limited up to three. 

A native TSS is always located at the stack bottom. This TSS is used by the 
native secondary kemel execution environment. As was described above, a secondary 
20 kemel is started using the initial TSS located in the hidden part of the kemel context. 
During the initialization phase, a secondary kemel typically installs its own native TSS 
using the Itr nanokemel method. Such a native TSS overrides the initial one in the TSS 
stack. 

Once a secondary kemel is preempted by an intermpt, or a nanokemel method 
25 is invoked via trap, the corresponding nanokemel task is activated by the task gate. 
The nanokemel task always pushes a pointer to its own TSS into the stack. In the case 
of interrapt, the nanokemel intenupt task simply switches to the primary kemel in 
order to process the intermpt. In the case of trap, the nanokemel may enable interrapts 
while executing code of a long method. 
30 When iaterrapts are enabled, a nanokemel method can be preempted by the 

nanokemel interrapt task activated by a task gate and the intermpt task, in turn. 
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switches to the primary kernel in order to process the interrupt. Once the interrupt is 
processed, the primary kernel returns to the nanokemel in order to resume execution of 
the interrupted method. 

Because the trap and interrupt tasks can be nested, it is necessary to use 
5 different (non overlapped) stacks when executing the task code. The nanokemel uses a 
special interrupt stack in the interrupt task. 

When an intermpt occurs in the trap task, the processor switches to the interrupt 
task and saves the general-purpose registers into the trap task TSS (which is the current 
TSS at this moment). Thus, once interrapts are disabled again in the trap task, it is 
10 necessary to re-mitialize the EIP, EBX and ESP fields of the trap TSS because they 
could be corrupted by an intermpt. 

Scheduler 

The main role of an operating system scheduler is to choose the next task to 

15 ' run. Because the nanokemel controls execution of operating systems, the nanokemel 
scheduler chooses the next secondary operating system to run. In other words, the 
nanokemel adds an extra scheduling level to the whole system. 

Note that, in the nanokemel architecture, the primary operating system has a 
higher priority level with respect to secondary systems and the CPU is given to a 

20 secondary system only when the primary one is in the idle loop. We can say that the 
primary kernel is not preemptable and it explicitly invokes the nanokemel scheduler 
through the idle method called in the idle loop. Once an intermpt occurs when running 
a secondary system, the primary kernel intermpt handler is invoked. From the primary 
kernel perspective, such an intermpt preempts the background thread executing the 

25 idle loop. Once the intermpt is handled and all related tasks are done, tiie primary 
kemel retums to the nanokemel which invokes the nanokemel scheduler in order to 
determine the next secondary system to run. From the primary perspective, the kemel 
just retums to the background thread preempted by the intermpt. The secondary 
activity is transparent for the primary kemel and it does not change the primary system 

30 behavior. 



wo 2005/031572 



55 



PCT/IB2004/003334 



The nanokemel may implement different scheduling policies. By default, 
however, a priority based algorithm is used. Note that, at the same priority level, the 
nanokemel uses a round-robin scheduling policy. Priority of a given secondary kemel 
is statically configured at system image build time. 
5 Whatever the scheduUng policy is implemented, the scheduler has to detect 

whether a given secondary system is ready to run. This condition is calcxilated as the 
bitwise logical and operation between the pending FEAT and enabled FjEX fields of the 
kemel context A non zero result indicates that the system is ready to run. 

As was described above, each bit in the pending VEX and enabled VEX pair 
10 represents a virtual exceptioiL Rephrasing the ready to run criteria, we can say that a 
secondary system is in the ready to run state if there is at least one non masked pending 
virtual exception. 

Among all virtual exceptions which are typically mapped to the hardware and 
software (cross) interrapts, there is a special virtual exception {running) reflecting 
15 ' whether the kemel is currently idle. 

The running bit is cleared in the pending VEX field each time a secondary 
kemel invokes the idle method and the running bit is set in the pending VEX^qIA each 
time a virtual exception is delivered to the secondary kernel. 

The running bit is normally always set in the enabled VEX field for a running 
20 secondary kemel. The nanokemel sets this bit when a secondary kemel is started and it 
resets this bit when a secondary kemel is halted. The secondary kemel should never 
clear tiie running bit when masking/unmasking intermpts ms^ped to virtual 
exceptions. 

Note that an extemal agent is able to suspend/resume execution of a secondary 
25 kemel by clearing/restoring the enabled VEX field in its kemel context. This feature 
opens possibilities for a scheduling poUcy agent to be implemented outside of the 
nanokemel, as a primary kemel task. In addition, this also enables a debug agent for a 
secondary kemel to be running as a task on top of the primary kemel. An advantage of 
such a secondary debug agent is that all services provided by the primary operating 
30 system become available for debugging (e.g., networking stack) and the secondary 
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kernel debugging may be done concuixently with critical tasks running on the primary 
operating system- 
Cross Interrupts 

5 This section mostly consolidates information (already given in previous 

sections) related to the nanokemel cross interrupts mechanism. 

Two following kinds of cross interrupts will be considered here: 
<oa cross interrupt sent to a secondary kemel 
©a cross interrupt sent to the primary kemel 
10 In order to send a cross interrupt to a destination secondary kemel, a source 

kemel jBrst sets a bit corresponding to the cross interrapt source in the pending XIRQ 
field of the destination kemel context Then the source kemel posts the cross interrapt 
VEX to the destination kemel setting the corresponding bit in the pending VEX&eld of 
the destination kemel context. Once tihie cross interrapt handler is called by the 
15 nanokemel, it checks the pending XIRQ field, clears bit corresponding to the pending 
cross interrapt source and finally invokes handlers attached to this somrce. Both source 
and destination kernels uses atomic instractions to update the pending XIRQ field. 
Note that the same algorithm is used by both types of source kemel: primary and 
secondary. 

20 In order to send a cross interrapt to the primary kemel, a secondary kemel first 

sets a bit corresponding to the cross interrupt source in the pending XIRQ field of the 
primary kemel context Then the secondary kemel invokes the nanokemel executing 
the XIRQ trap. The nanokemel immediately preempts the secondary kemel and 
invokes the primary low-level cross interrapt handler which checks the pending XIRQ 

25 field, clears bit corresponding to the pending cross interrapt source and finally invokes 
handles attached to this source. 

The cross interrapt zero must not be used by kernels. This interrapt is reserved 
for the nanokemel to notify kernels that a halted kemel has been started or a ruiming 
kemel has been halted. In other words, the cross interrapt zero notifies running kernels 

30 that the global system configuration is changed. It is broad casted to all ruiming kemels 
each time the state of the running field is changed in a kemel context. 
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FPU Management 

The FPU engine is a computing resoiirce which is typically shared by all 
operating systems nmning in the nanokemel environment. 
5 On IA-32 Intel architecture, the nanokemel manages the FPU sharing in a lazy 

manner. This means that when a switch from one operating system to another occurs, 
the FPU engine is not immediately given to the newly scheduled operating system 
instead, the FPU switch is deferred tmtil the the newly scheduled system really 
executes floating-point instructions and accesses floating-point registers. 

10 Such a lazy FPU dispatching algorithm allows the nanokemel to reduce the 

system switch time. This is especially important in order to reduce the primary 
interrupt latency because FPU is normally not used at intemcpt level and therefore it is 
usually not necessary to save and restore FPU registers in order to preempt a secondary 
operating system and to call a primary interrupt handler. 

15 . The nanokemel handles an FPU owner global variable pointing to the context 

' of the kernel which currently uses FPU. In case there is no FPU owner, the FPU owner 
context is set to zero. An FPU context is located in the hidden part of the kemel 
context. Such a context keeps state of the FPU engine (i.e., floating-point registers and 
status) when the kemel is not FPU owner. Obviously, the state of the FPU owner is 

20 kept by the FPU engine hardware. When the nanokemel changes the FPU owna:, the 
FPU state is saved to the old FPU context and restored from the new one. 

The nanokemel uses the emulation bit (EM) of the CRO register in order to 
provoke an exception when FPU is used by a non FPU owner. The CRO register image 
takes a part in the hidden part of the kemel context. The CRO register is saved (to the 

25 old context) and restored (from the new context) at system switch. The EM bit is set in 
all kemel contexts except the FPU owner where it is cleared. In addition, the 
nanokemel intercepts the invalid opcode (#UD) and the device not available (#NM) 
exceptions for all non FPU owners while the FPU owner handles these exceptions in a 
native way. 



wo 2005/031572 PCT/IB2004/003334 

58 



An FPU switch occurs when the nanokemel intercepts one of the FPU related 
exceptions: #UD or #NM. In order to switch the FPU engine between two kernels, the 
nanokemel releases the current FPU owner and assigns the new one. 

In order to release the current FPU owner, the nanokemel saves the current 
5 FPU state in the kernel context and sets the EM bit in the CRO register image. In 
addition, the nanokemel gates are installed in the real IDT in order to intercept the 
#UD and #NM exceptions. 

In order to assign a new FPU owner, the nanokemel restores the FPU state 
from the kemel context and clears the EM bit in the CRO image. In addition, the native 
10 gates are installed in tiie real IDT in order to handle the #UD and #NM exceptions in a 
native way while owning FPU. 

The nanokemel uses the OSFXSR bit of the CR4 register in order to optimize 
the saving and restoring operations. The CR4 register image takes a part in the hidden 
part of the kemel context. It is saved (to the old context) and restored (from the new 
15 context) at system switch. The nanokemel uses the CR4 register image in order to 
determine which type of FPU context should be saved or restored: standard or 
• extended. This allows the nanokemel to do not save/restore the extended FPU context 
for an operating system which uses neither MMX nor SIMD features. 

Because the nanokemel uses the EM bit of the CRO register in order to 
20 implement a lazy FPU switch, a kemel is not allowed to change the state of this bit. In 
particular, this means that the FPU emulation is not supported by a kemel ported to the 
nanokemel architecture. 

Note that usually an operating system kemel uses the TS bit of the CRO register 
in order to implement a lazy FPU switch between processes. Because the CRO register 
25 image takes a part in the kemel context and therefore it is saved and restored at system 
switch, the native FPU management can be kept almost unchanged in the nanokemel 
environment. 

Note however that the TS bit is automatically set by a task switch. This means 
that FPU exceptions can occur in a secondary kemel even if TS bit is logically cleared 
30 from the kemel point of view. Such spurious FPU exceptions are introduced by task 
gates used by the nanokemel in order to intercept secondary exceptions. In order to 
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detect such spurious FPU exceptions and quietly ignore tiiem (just clearing the TS bit), 
a secondary kernel should handle a software copy of the TS bit. The nanokemel assists 
to secondary kernel in this task providing a field dedicated for this purpose in the 
kernel context. 

5 

Other aspects and embodiments 

It will be clear fi:om the forgoing that the above-described embodiments are 
only examples, and that many other embodiments are possible. The operating systems, 
platforms and progranooning techniques mentioned may all be fi*eely varied. Any other 
10 modifications, substitutions and variants which would be apparent to the skilled person 
are to be considered within the scope of the invention, whether or not covered by the 
claims which follow. For the avoidance of doubt, protection is sought for any and all 
novel subject matter and combinations thereof disclosed herein. 
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